K. Mader, M. Stampanoni (4quant.com/SRI2015)
fps1000 - (640 x 480 x 840 fps) for /$400 \( \rightarrow \) 780Mb/s \( \rightarrow \) 66TB/day
The breakdown of science into component tasks has changed from a time perspective over the last years and how we expect it to change by 2020.
The primary trends to focus on are the rapid decrease in measurement time and increasing importantance of post-processing.
The breakdown of science into component tasks has changed from a time perspective over the last years and how we expect it to change by 2020.
The primary trends to focus on are the rapid decrease in measurement time and increasing importantance of post-processing.
Thank you Alessandra and Alberto!
The first of challenges with these data is to start stitching them together.
Algorithmically it is very simple:
\( \textrm{Overlap}_{A \rightarrow B} = \)
\( \textrm{argmax}_{\vec{x}}\left[\mathcal{FT}^{-1} \left\{\mathcal{FT}\{A\}\mathcal{FT}\{B\}^{*}\right\} \right] \)
We need to stitch each scan with all of its neighbors
We have a few more parameters (\( \theta \)) and we know the initial positions so we optimize in a limited range rather than the entire image.
Now that our task is clear we just execute \( \rightarrow \)
The machine being used by the user which is responsible for creating jobs which need to run
Distributes task across multiple machines, coordinates communication between machines
The nodes where the compuation is actually done
tell machine 1 to load images 0 to 2tell machine 2 to load images 3 to 5tell share the images to calculate the overlap
The most important job for any piece of analysis is to be correct.
Almost all image processing tasks require a number of people to evaluate and implement them and are almost always moving targets
The last of the major priorities is speed which covers both scalability, raw performance, and development time.
Untested Algorithms
Image Definitions
Mat: CV_8UC1 1D array of unsigned bytesImg<RealType< T > & NativeType< T > >> 2D array of generically typed with cursor accessitk::Image< TPixel, VImageDimension >imread could be 2D, maybe 3D, sometimes even 4Dnumpy runtime problem determination
You can analyze streaming tweets during World Cup matches to see goals before you see them on television
You can process a petabyte of clicks to optimize ad-campaigns
Powerful tools
Distributed computation
Interactive computation
When a ton of heterogeneous is coming in fast.
Performant, scalable, and flexible
10X, 100X, 1000X is the same amount of effort
Director of AMPLab said their rate limiting factor is always enough interesting data
You conduct the orchestra, you don't spend time telling each person what to do
Big Data is a very popular term today. There is a ton of hype and seemingly money is being thrown at any project which includes these words. First to get over the biggest misconception big data isn't about how much data you have, it is how you work with it
Before we can think about Big Data we need to define what it's replacing, Small Data.
The 0815 approach, using standard tools and lots of clicking, one image at a time
Elastic, On-Demand
These frameworks are really cool and Spark has a big vocabulary, but flatMap, filter, aggregate, join, groupBy, and fold still do not sound like anything I want to do to an image.
I want to
We have developed a number of commands for SIL handling standard image processing tasks
Fully exensible with
New detectors certainly provide a substantial amount of new data, but what about old data?
Look at this new technique from famous journal, let's try it on all of our old data
It seems to give the same results
Well what about this technique?
allData = loadPSIImages("s3://psi-data/femur-bones/*.tif") ++
loadAPSImages("s3://aps-data/bone-images/*.hdf5") ++
loadESRFImages("s3://esrf-data/femur-bones/*.tif")
allData.registerImageMetaTable("BoneImages")
How many of the bone samples measured had a cortical thickness larger than 30mm? \[ \downarrow \]
SELECT COUNT(*) FROM (
SELECT THICKNESS(
SEGMENT_CORTICAL(
GAUSSFILT(imgData)
)
) AS thk WHERE thk>30
)
Is there a correlation between cortical thickness and mineral content?
\[ \downarrow \]
SELECT THICKNESS(cortical),
IMG_AVG(cortical) FROM (
SELECT SEGMENT_CORTICAL(
GAUSSFILT(imgData)
) AS cortical
)
CREATE TABLE FilteredBones AS
SELECT sampleId,GaussianFilter(image) FROM BoneImages
CREATE TABLE ThresholdImages AS
SELECT boneId,ApplyThreshold(image,OTSU) FROM FilteredBones
CREATE TABLE CellImages AS
SELECT boneId,ComponentLabel(image) FROM PorosityImages
CREATE TABLE CellAnalysis AS
SELECT boneId,AnalyzeShape(CellImages) FROM CellImages GROUP BY boneId
WRITE TABLE CellAnalysis "s3n://bones/cells.csv"
The query is processed by the SQL parsing engine
In the biological imaging community, the open source tools of ImageJ2 and Fiji are widely accepted and have a large number of readily available plugins and tools.
We can integrate the functionality directly into Spark and perform operations on much larger datasets than a single machine could have in memory. Additionally these analyses can be performed on streaming data.
val wr = new WebcamReceiver()
val ssc = sc.toStreaming(strTime)
val imgList = ssc.receiverStream(wr)
val filtImgs = allImgs.
mapValues(_.run("Median...","radius=3"))
val totImgs = inImages.count()
val bgImage = inImages.
reduce(_ add _).multiply(1.0/totImgs)
val eventImages = filtImgs.
transform{
inImages =>
...
val corImage = inImages.map {
case (inTime,inImage) =>
val corImage = inImage.subtract(bgImage)
val corImage = inImages.map {
case (inTime,inImage) =>
val corImage = inImage.subtract(bgImage)
(corImage.getImageStatistics().mean,
(inTime,corImage))
}
val eventImages = filtImgs.
transform{
inImages =>
val corImage = inImages.map {
case (inTime,inImage) =>
val corImage = inImage.subtract(bgImage)
(corImage.getImageStatistics().mean,
(inTime,corImage))
}
corImage
}
eventImages.filter(iv => Math.abs(iv._1)>20).
foreachRDD(showResultsStr("outlier",_))
A spinoff - 4Quant: Big Image Analytics
While our framework is commercial, we build on top and integrate into open-source tools so that the research is reproducible by anyone, anywhere, using any machine (it just might take a bit longer)
We also try to publish as much code, educational material, and samples as possible on our github account.

Big Data Finite Element Analysis using GraphX and Spark
Here we use a KNIME -based workflow and our Spark Imaging Layer extensions to create a workflow without any Scala or programming knowledge and with an easily visible flow from one block to the next without any performance overhead of using other tools.
We are interested in partnerships and collaborations
First we start our cluster:
spark-ec2.py -s 50 launch 4quant-image-cluster
Load all of the samples (56TB of data)
loadImage(
"s3://../brain_*_*_*/rec.tif"
)
\[ \downarrow \] A Resilient Distributed Dataset (RDD) \[ \textrm{Images}: \textrm{RDD}[((x,y,z),Img[Double])] =\\ \left[(\vec{x},\textrm{Img}),\cdots\right] \]
filteredImages = Images.gaussianFilter(1,0.5)
volFraction = Images.threshold(0.5).
map{keyImg =>
(sum(keyImg.img),keyImg.size)
}.reduce(_+_)
Developing in Scala brings additional flexibility through types[1], with microscopy the standard formats are 2-, 3- and even 4- or more dimensional arrays or matrices which can be iterated through quickly using CPU and GPU code. While still possible in Scala, there is a great deal more flexibility for data types allowing anything to be stored as an image and then processed as long as basic functions make sense.
[1] Fighting Bit Rot with Types (Experience Report: Scala Collections), M Odersky, FSTTCS 2009, December 2009
A collection of positions and values, maybe more (not an array of double). Arrays are efficient for storing in computer memory, but often a poor way of expressing scientific ideas and analyses.
combine information from nearby pixels
determine groups of pixels which are very similar to desired result
trait BasicMathSupport[T] extends Serializable {
def plus(a: T, b: T): T
def times(a: T, b: T): T
def scale(a: T, b: Double): T
def negate(a: T): T = scale(a,-1)
def invert(a: T): T
def abs(a: T): T
def minus(a: T, b: T): T = plus(a, negate(b))
def divide(a: T, b: T): T = times(a, invert(b))
def compare(a: T, b: T): Int
}
Simple filter implementation
def SimpleFilter[T](inImage: Image[T])
(implicit val wst: BasicMathSupport[T]) = {
val width: Double = 1
kernel = (pos: D3int,value: T) => value * exp(-(pos.mag/width)**2)
kernelReduce = (ptA,ptB) => (ptA + ptB) * 0.5
runFilter(inImage,kernel,kernelReduce)
}
Spectra as well supported types
implicit val SpectraBMS = new BasicMathSupport[Array[Double]] {
def plus(a: Array[Double], b: Array[Double]) =
a.zip(b).map(_ + _)
...
def scale(a: Array[Double], b: Double) =
a.map(_*b)
We have all of the filtered images and we want to stitch them together in a smart way.
pairImages = Images.
cartesian(Images).
filter((im1,im2) => dist(im1.pos,im2.pos)<1)
The cross correlation can then be executed on each pair of images from the new RDD (pairImages) by using the map command
displacementField = pairImages.
map{
((posA,ImgA), (posB,ImgB)) =>
xcorr(ImgA,ImgB,
in=posB-posA)
}
From the updated information provided by the cross correlations and by applying appropriate smoothing criteria across windows.
smoothField = displacementField.
window(3,3,3).
map(gaussianSmoothFunction)
This also ensures the original data is left unaltered and all analysis is reversible.
The final stiching can then be done by
alignImages.
filter(x=>abs(x-tPos)<img.size).
map { (x,img) =>
new Image(tSize).
copy(img,x,tPos)
}.combineImages()
If you looked at one 1000 x 1000 sized image every second
It would take you
139
hours to browse through a terabyte of data.
| Year | Time to 1 TB | Man power to keep up | Salary Costs / Month |
|---|---|---|---|
| 2000 | 4096 min | 2 people | 25 kCHF |
| 2008 | 1092 min | 8 people | 95 kCHF |
| 2014 | 32 min | 260 people | 3255 kCHF |
| 2016 | 2 min | 3906 people | 48828 kCHF |
The standard approach to producing data is top-down.
In a study with 80 samples with a variability of 10% and a 5% difference between them, we can afford only 1 test and maintain the commonly accepted \( p<0.05 \)
We thus need smart people making and choosing the best models so that we can preserve the signifance.
The Big Data era suggests a new approach. First given an abundance of data, we can afford to run many tests.
In a study with 800 samples with a variability of 10% and a 5% difference between them, we can afford over 10M tests and maintain the commonly accepted \( p<0.05 \)
For knowledge creation this means a switch
\[ \textrm{Transistors} \propto 2^{T/(\textrm{18 months})} \]
Based on data from https://gist.github.com/humberto-ortiz/de4b3a621602b78bf90d
There are now many more transistors inside a single computer but the processing speed hasn't increased. How can this be?
The range of cloud costs (determined by peak usage) compared to a local workstation with utilization shown as the average number of hours the computer is used each week.
1.5 man-years
For many traditional scientific studies, the sample count is fairly low 10-100, for such low counts, if we look at the costs (estimated from our experience with cortical bone microstructure and rheology)
As studies get bigger (>100 samples), we see that these costs begin to radically shift.
Bottleneck is filesystem connection, many nodes (10+) reading in parallel brings even GPFS-based infiniband system to a crawl
One of the central tenants of MapReduce™ is data-centric computation \( \rightarrow \) instead of data to computation, move the computation to the data.
Combining many different components together inside of the Spark Shell, IPython or Zeppelin, make it easier to assemble workflows